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AN INVESTIGATION OF A SCORING PROCEDURE DESIGNED TO 
ELIMINATE SCORE VARIANCE DUE TO GUESSING IN MULTIPLE-CHOICE TESTS 



Introduction 

The coit, antional multiple-choice response mode requires the 
examinees to identify and mark the correct choice to each item. 
The most direct method of scoring such responses is number-right 
scoring, whereby every correctly marked choice in an item is as- 
signed a score of one and each other choice is assigned a score of 
zero. A major limitation of the foregoing procedure is that the 
examiner is unable to determine whether a correct response is the 
result of sufficient knowledge to answer the question correctly 
or whether a correct response is a successful guess among two, 
three, or more choices. Although it is generally agreed that some 
attempt should be made to control the effect of guessing, to date 
few scoring methods other than the conventional correction for 
guessing have been proposed that explicitly attempt to do so, and 
each method has its critics. 

The conventional correction for guessing simply involves sub- 
tracting from number-right scores, a quantity reflective of the 
number of items answered incorrectly. The fact that this procedure 
is not entirely satisfactory is evident from the numerous studies 
that have argued for or against, this procedure. (See Cross, 1973, 
or Diamond and Evans , 1971, for a critical review of these studies.) 
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Coombs (1953) proposed an alternative test-taking procedure 
designed to assess partial information. With this procedure, exam- 
inees are directed to mark only those choices they are certain are 
incorrect and to leave the correct choice unmarked. A scoring rule 
presumed to insure this type of behavior is is imposed. On an item 
having 'C choices, one point is awarded for each distracter marked, 
but a score penalty of (C - 1) is imposed if the correct choice is 
marked. Thus, an examinee is able to express and receive credit 
for partial information but will be severely penalized if he 
erroneously marks the correct choice as a distracter. Consequently, 
guessing under these conditions is not a profitable game to play 
as suggested by*Coombs, Milholland and Womer (1956) and by Lord 
and Novick (1968, p. 315). Several studies have investigated the 
effect this response mode and scoring procedures have on the relia- 
bility and validity of the resulting scores (Coombs et at, , 1956; 
Collet, 1971; Koehler, 1972). The results of these studies suggest 
that the reliability and validity of the scores can be expected to 
improve or show no difference when compared to other scoring pro- 
cedures. Aside from the effect this procedure has on reliability 
or validity, from a logical standpoint, it would seem that the 
elimination scoring procedure will inhibit guessing behavior more 
effectively than any other testing procedure. Two. major drawbacks 
of this procedure, however, are the additional time required to 
administer a test and the inconvenience of having to train examinees 
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in the use of this response mode every time the test is admin- 
istered. The Coombs response mode was selected for use in the 
present study to establish criterion score sets for the experimental 
test which would reflect varying degrees of guessing. 

The present study was designed to investigate a novel scoring 
system that would make it possible to provide scores that closely 
approximate those that (a) are free from the guessing component, 
or (b) include a controlled guessing component as initially deter- 
mined by use of the Coombs response mode. The proposed scoring 
system is designed to be used in conjunction with the standard 
response mode, and it does not require directions admonishing the 
examinees to refrain from guessing. Consequently, it would offer 
a distinct advantage over present scoring procedures which either 
employ directions that attempt to discourage guessing behavior and 
which may have an adverse effect for cautious examinees, or require 
the examinees to be trained in the uses of an alternate response 
mode every txme the test is administered. 

Data Collection 

A series of three teacher-written algebra tests were administered 
to 12 sections of eleventh-grade students attending a suburban Phila- 
delphia high school. The six participating teachers agreed to use 
the scores from these tests for grading purposes, and the students 
were so informed, thus insuring a conscientious effort. The exam- 
inees were directed to respond to each of the tests in two distinct 
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ways: using the Coombs response mode, which was used with appro- 
priate directions during the first part of the testing period; 
and using the conventional response mode, which was used with 
directions that encouraged guessing during the second part of the 
test period. Two initial tests were designed to acquaint the 
students with the novel response mode and to provide feedback 
on their performance. Only the data from a third test (n = 230) 
were used for the experimental analysis. The test consisted of 
20 multiple-choice items with four choices per item. 

In addition to the scores resulting from the experimental test, 
final course grades and scores on the final examination were ob- 
tained .for each. examinee to be used as "external" validity criteria. 

Data Analysis 

Three different scoring procedures were used to score the 
"conventional" responses made during Parf'H of the test administration. 
First, number-right scores (NR) were computed by assigning a score 
credit of- one to every item for which the correct choice was in- 
dicated and a score credit of zero to all other items. Second, a 
corrected for guessing score (NRC) was computed by subtracting from 
each examinee's NR score an amount equal to one-fourth of the number 
of items wrong. It should be noted that the NR and NRC scores are 
both derived from the responses made when the examinees were directed 
to indicate* the correct choice with no penalty for guessing, Conse- 
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quently, the NRC scores do not reflect the influence of the di- 
rections not to guess that usually accompany formula scoring. 
Finally, the conventional responses were used as a basis for the 
proposed sci;ring system. 

By considering simultaneously the responses made under both 
the conventional and Coombs response mode conditions, it was possible 
to compute several sets of scores, each based on the number of items 
answered correctly when the number of correctly guessed items is 
controlled. The number of choices among which guessing occurred 
is the distinguishing feature of these score sets. Guessing-f ree 
scores were obtained by assigning a score credit of one to an item 
if it was answered correctly, provided that all four distracters 
were identified. A score credit of zero was assigned to all other 
items. Thus, these guessing-f ree (GF) scores came only from items 
where it appeared the examinee knew the answer with a substantial 
degree of assurance. 

A second set of scores was computed by assigning a score of 
one to all items from the GF score set and also to items for which 
successful guessing was limited to two choices (GF - 2 scores). Two 
more partially GF score sets were computed in an analogous manner 
for which successful guessing was limited at most to three and four 
choices (GF - 3 and GF - 4). It should be noted that the number of 
items answered correctly when guessing is free to vary among all 
choices is simply the number of items answered correctly, or the 
NR score. 
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The scoring rule proposed by Coombs (1953) was also used to 
score the elimination mode responses yielding yet another set of 
scores simply referred to hereafter as Coombs scores. 

The experimental scoring procedure requires that scores be 
calculated on a set of variables for each examinee. The operational 
definitions for the six basic variables are presented in Table 1. 
The square of each of these variables and the cross-product between 
each pair were then computed. This resulted in a total of 27 
variables. These variables were then used to predict the giiessing- 
free and each of the partially guessing-f ree score sets outlined 
above. The forward selection program of the Statistical Package 
for the Social Sciences (Nie, Bent and Hull, 1970) was used for 
this purpose. 

Because the proposed scoring procedure uses a multiple regression 
technique involving a large number of predictor variables, cross- 
validation of the predicted guessing-f ree scores was essential. 
To this end, the 230 answer sheets were randomly separated into 
two groups (groups A and B) and the b weights associated with the 
variables entering the prediction equation in each group were applied 
to the scores for the same variables in the alternate groups. The 
guessing-free and each of the partially guessing-f ree score sets 
served as the criterion for separate regression analysis. 

The utility of the proposed scoring system for predicting each 
criterion was judged by comparing magnitudes of the correlation 
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coefficients between the cross-validated scores and each of the 
guessing-f ree criterion scores with the correlation coefficients 
between scores yielded by two conventional methods of scoring 
and the same criterion. 

Since the expressed purpose of the proposed scoring system 
was to yield a set of scores free of a guessing component, the 
guessing-f ree (GF) score set was the criterion of central im- 
portance . 

The proposed scoring system was also used to predict the two 
external validity measures directly. The correlation of the cross- 
validated scores with the final-course and final-examination grades 
was compared to - the parallel correlations for NR and NRC score 
sets. Although there was no reason to expect a predictive superiority 
of the proposed system to predict these scores, it was of interest to 
determine the ability of the scoring variables to predict validity 
measures that exist in 'many practical testing situations. 

Results 

The means, standard deviations, and intercorrelations between 
the score sets generated from the experimental test are presented. in 
Tables 2 and 3 for group A and for group B, respectively. Included in 
these tables are reliability estimates as well as the correlation of 
each test score set with the two external criteria. The matched-half 
reliability coefficients were computed by means of the Rulon formula 
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with a special splitting of the items to form halves that would 
measure, as nearly as possible, the same content area. There is 
no reliability estimate provided for the NRC scores- Because of the 
unusual format of the answer sheets, individual item., scores could not 
be computed to provide a reliability estimate from a single test 
administration. Moreover, these scores reflect only the application 
of the correction formula and cannot be interpreted the same as 
corrected scores given with directions appropriate to them. 

The descriptive statistics in these tables are presented to 
provide some insight into the nature of the scores being predicted 
or compared in the following sections. 

The results of the regression analyses showed that the relative 
worth of the variables for predicting each of the criteria (GF, 
GF - 2, GF - 3, GF - 4) was quite different when compared across 
groups. 

The cross-validated scores from the proposed scoring system 
were compared to the scores resulting from number-right and formula 
scoring of the same responses. The correlation of these scores with 
the GF and partially GF scores are presented in Table 4. In every 
case, the Ml and NRC scores correlated more highly with each criterion 
than did the cross-validated scores. 

The ability of the proposed scoring system to predict the two 
external score sets was compared to number-right and formula scoring 
as was done for the other criteria. The observed correlations are 
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presented in Table 5. Inspection of Tables 4 and 5 shows the 
correlation between the cross-validated scores and each criterion 
to be lower than the correlation between the NR or NRC scores with 
the same criteria. 

Discussion 

With the exception of one variable (per cent correct), 
each of the score variables used in the proposed scoring procedure 
is based on one of two basic statistics; namely, the proportion of 
examinees that selected each choice and the point-biserial corre- 
lation coefficient between the dichotomy of marking or not marking 
each choice and total scores on the test in which the choice in 
included. The use of such item/choice statistics to assign item 
scores is not novel. A scoring procedure proposed by Chernoff 
(1962) was based on item difficulty alone. The use of choice- 
total correlations is the basis of certain option-weighting pro- 
cedures such as those investigated by Davis and Fifer (1959); 
Hendrickson (1971); Sabers and White (1969). It was thought that 
by including all of these choice statistics in arriving at a total 
score, a more effective scoring procedure would result than if just 
one such statistic was used. . The results of this study indicate 
that such was not the case. It may be that the way in which these 
item statistics were combined in this study limited their utility 
for computing test scores. The choice difficulty, and discrimination 
coefficients associated with every choice marked by an examinee, were 
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summed across all items, and the mean values, for both correct 
and incorrect choices, were computed. For different examinees, 
these means were computed using different n's, depending on the 
number of items answered correctly and on the number of items 
omitted. If for discussion we can assume that more credit should 
be assigned when difficult items are answered correctly, a potential 
difficulty arises • Suppose an examinee answers every item correctly 
His score for score variables VI and V2 would be the mean correct- 
choice statistics for the test. However, if an uninformed examinee 
supplies a random guess to every item, and by chance correctly 
guessed, say two of the most difficult items, his score for variable- 
VI and V2 would be somewhat higher than the well-informed examinee. 
If one extends this type of thinking toward the middle range of 
ability, it seems reasonable that the first four score variables 
(VI, V2, V3, V4) may be greatly affected by chance and by the number 
of items over which they are computed. If there was a defensible 
way in which these choice statistics could be combined, and perhaps 
moderated by number-right scores, to assign individual item scores, 
perhaps more valid and reliable scores would result. 

Independent of the proposed scoring system, it is of interest 
to consider the psychometric properties of the various guessing 
score sets generated in this study. Inspection of Tables 2 and 3 
reveals successively higher matched-half reliability estimates for 
the par tially-guessing-f ree score sets as they become more nearly 



13 



11 

guessing-free. This finding may seein^ especially unusual in light 

of the fact that when guessing scores were computed from these 

same data, the split-half reliability estimates were significantly 

different from zero in most cases. However, the true-score model 

presented by Frary (1969a, 1969b) indicates that the reliability 

of the scores may increase or decrease when the guessing component 

is removed, depending on the correlation between the true aiid 

guessing components. In order that the reliability increase, 

this correlation must be negative and a given inequality must be 

satisfied. Use of the appropriate formulas presented by Frary 
as 

(1969a) showed both of these conditions to be satisfied. These 
findings are therefore consistent with theoretical expectations 
and argue against the notion that score reliability can be expected 
to decrease when the effect of guessing is removed, even though the 
guessing component itself may be reliable. No systematic effect 
on validity was noted when the guessing component was removed as 
indicated by the correlation of the guessing-free and partially- 
guessing-f ree. score sets with the two external criteria presented 
in the same tables. These data suggest that significant increases 
in reliability can result if score credit is assigned only to items 
that were answered correctly without guessing of any type. This is 
quite different from assigning score credit as required by the 
Coombs' scoring rule which did not appreciably affect the reliability 
or validity of the scores in this study. Of course, there is no 
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way to determine the number of items a student knows without 
imposing some type of penalty for guessing, or reward for not 
guessing. To advise students of a penalty for guessing and then 
delete the penalty in computing the scores (i.e. , assign credit 
only to items for which complete knowledge is expressed), would 
be inappropriate. Consequently, these findings are only of 
theoretical interest at present. 

At the very least, the markedly higher reliability estimates 
obtained for the guessing-f ree and par tially-guessing-f ree score 
sets emphasize the potential for any scoring system that reduces 
or eliminates guessing. 

A major assumption on which the proposed scoring procedure 
rests is that the alternate response mode can effectively eliminate- 
score variance due to guessing. While this assumption holds a 
certain intuitive appeal, it may not be reasonable for the type 
of test used in this study. Most of the items used in the experi- 
mental test required the solution to an algebraic problem. The 
distractors represented incorrect solutions that were thought by 
the investigator to represent plausible errors. Consequently, if 
the student arrived at an incorrect answer that matched one of the 
choices, he may well have dismissed any doubts he had in the process 
of arriving at that solution and felt confident that his answer was 
right, since it was among the choices. In this case, he probably 
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would have marked all remaining choices. At the time he indicated 
his answer, he would not have thought he was guessing, even though 
he may have made several guesses in the process of arriving at his 
solution. If this hypothesis about the s tudents V strategies is 
true, the various criterion scores which were thought by the 
investigator to reflect varying degrees of guessing may have 
been invalid as such. This possibility was perhaps less likely 
with items that did not require a solution to a problem. 
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TABLE 1 

OPERATIONAL DEFINITIONS OF THE SIX BASIC VARIABLES 



Variable 
Number 



Variable 
Name 



Definition 



VI 



V2 



V3 



V4 



V5 



V6 



The mean difficulty of the correct 
choices marked by the examinee 

The mean discrimination^ coeffi- 
cient of the correct choices marked 
by the examinee 

a 

The mean difficulty of the dis- 
tracters marked by the examinee 

The mean discrimination^ coeffi- 
cient of the distracters marked 
by the examinee 

The proportion of correct choices 
marked by the examinee 

The variance of the difficulty 
values for the correct choices 
marked by the examinee 



^Difficulty is defined as the proportion of examinees who marked a 
particular choice. 



^Discrimination coefficient is defined as the point-biserial corre- 
lation between marking or not marking a choice and total test 
scores uncorrected for overlap. 
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TABLE 2 

DESCRIPTIVE STATISTICS AND INTERCORRELATIONS AMONG 
SEVEN SCORE SETS GENERATED FROM THE EXPERIMENTAL TEST 

(Group A) 



SCORE SETS 



NR NRC "OOMBS GF GF-2 GF-3 GF-4 



NR .996 .933 .895 .956 .971 .983 

NRC .946 .897 .955 .968 .980 

COOMBS .910 .953 .957 .952 

GF .951 .913 .911 

GF-2 ' .984 .977 

GF-3 -993 



FE • .651 .668 .667 .648 .648 .652 .661 

FG .619 .642 .697 .653 .630 .636 .639 



Mean 10.53 8.35 33.83 8.23 9.66 10.10 10.35 

Standard 

Deviation 4.25 5.19 20.73 5.02 4.64 ,4.44 4.35 



Split-half ' .812 .815 .898 .859 .836 .814 

KR-20 . 790 

Alpha .798 



TABLE 3 



DESCRIPTIVE STATISTICS AND INTERCORRELATIONS AMONG 
SEVEN SCORE SETS GENERATED PROM THE EXPERIMENTAL TEST 

(Group B) 



SCORE SETS 



NR NRC COOMS GF GF-2 GF-3 GF-4 

m 



NR .987 .954 .912 .962 .976 .978 

NRC .976 , .928 .972 .986 .988 

COOMBS .943 .970 .967 .961 

GF .952 .930 .919 

GF-2 .985 .975 

GF-3 .996 



FE .623 .636 .633 .607 .606 .620 .620 

FG .640 .651 .642 .633 .625 .641 .645 



Mean 10.66 8.45 34.18 8.58 9.96 10.37 10.48 

Standard 

Deviation 4.54 5.52 22.05 5.17 4.97 4.72 4.71 



Split-half 

KR-20 

Alpha 



.841 
.820 



.864 .904 .876 .854 .852 

.827 
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TABLE 4 

CORRELATION OF SCORES RESULTING FROM THREE SCORING 
METHODS WITH THE GF AND PARTIALLY GF SCORE SETS 



CRITERION SCORES 
GF GF-2 GF-3 GF-4 



Group A 

Number-Right .895 .956 .971 .983 

Number-Right 

Corrected .897 .955 .968 .980 

Proposed System 

(Cross-validated) .871 .945 .959 .970 



Number-Right .912 

Number- Right 
Corrected .928 

Proposed System 
(Cross-validated) .870 



Group B 

.962 .976 .978 

.972 .986 .988 

.948 .960 .975 
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TABLE 5 

CORRELATION OF SCORES RESULTING FROM THREE SCORING 
METHODS WITH EACH OF THE EXTERNAL CRITERIA 



Group A 



Group B 



Scoring Method 



Final 
Exam 



Final 
Grade 



Final 
Exam 



Final 
Grade 



Number-Right 



.6509 



.6193 



.6227 



,6397 



Number-Right 
Corrected 



.6677 



.6416 



,6367 



.6510 



Proposed System 
(Cross-validated) 



.5319 



.4467 



.5422 



.4899 
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